Reveal the economic of the city of Engagement, Ohio USA by using appropriate static and interactive statistical graphics methods for vast 2022 challenge.
This take-home exercise aims to reveal the economic of the city of Engagement, Ohio USA by using visualization techniques in R.
Economic considers the financial health of the city. Over time, are businesses growing or shrinking? How are people changing jobs? Are standards of living improving or declining over time?
Consider the financial status of Engagement’s businesses and residents, and use visual analytic techniques to address these questions.
Over the period covered by the dataset, which businesses appear to be more prosperous? Which appear to be struggling?
Links to the dataset:
CheckinJournal.csv
TravelJournal.csv
Packages, namely tidyverse and ggplot2 are
required for this exercise. This code chunk installs the required
packages and loads them onto RStudio environment.
packages = c('tidyverse','ggplot2','ggdist', 'ggridges','patchwork', 'ggthemes', 'hrbrthemes','ggrepel','ggforce',"HH","vcd",'scales','grid','gridExtra',
'formattable','readr', 'ggiraph', 'plotly', 'DT', 'gganimate','readxl', 'gifski', 'gapminder','treemap', 'treemapify','rPackedBar','ggstatsplot','ggside','broom','crosstalk')
for (p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
The dataset used in this exercise is Participants.csv, published by the IEEE for [VAST challenge 2022] (https://vast-challenge.github.io/2022/)
The code chunk below imports Restaurants.csv and
TravelJournal.csv and Pubs.csv from the data folder
into R by using read_csv()
function of readr and saves it as Tibble data frame
called travel and restaurants and pubs
restaurants <- read_csv("data/Restaurants.csv")
summary(restaurants)
restaurantId foodCost maxOccupancy location
Min. : 445.0 Min. :4.070 Min. : 48.00 Length:20
1st Qu.: 783.5 1st Qu.:4.395 1st Qu.: 71.75 Class :character
Median :1122.0 Median :5.095 Median : 85.00 Mode :character
Mean :1123.5 Mean :5.035 Mean : 84.70
3rd Qu.:1462.0 3rd Qu.:5.678 3rd Qu.:103.25
Max. :1805.0 Max. :5.920 Max. :119.00
buildingId
Min. : 27.0
1st Qu.:151.0
Median :294.5
Mean :432.9
3rd Qu.:754.2
Max. :991.0
pubs <- read_csv("data/Pubs.csv")
summary(pubs)
pubId hourlyCost maxOccupancy location
Min. : 442 Min. : 6.417 Min. :60.00 Length:12
1st Qu.: 780 1st Qu.: 9.725 1st Qu.:64.00 Class :character
Median :1118 Median :11.035 Median :69.50 Mode :character
Mean :1120 Mean :10.866 Mean :71.83
3rd Qu.:1458 3rd Qu.:12.379 3rd Qu.:77.50
Max. :1800 Max. :14.840 Max. :96.00
buildingId
Min. : 29.0
1st Qu.: 237.0
Median : 495.5
Mean : 484.8
3rd Qu.: 595.5
Max. :1012.0
travel <- read_csv("data/TravelJournal.csv")
summary(restaurants)
Before we proceed with data visualization we check if the datasets chosen have any null values using the code below.
restaurantId foodCost maxOccupancy location buildingId
FALSE FALSE FALSE FALSE FALSE
pubId hourlyCost maxOccupancy location buildingId
FALSE FALSE FALSE FALSE FALSE
Since there are no null values observed we proceed with our next step.
The Travel Journal contains financial transactions by a participant towards Work/Home Commute, Eating, Coming Back From Restaurant,Recreation (Social Gathering), Going Back to Home. We filter out the records related to Eating and Recreation (Social Gathering).
Calculating the total amount spent at the location as a difference of the starting balance and ending balance in the travel journal
travel_filt$amountSpent <- travel_filt$startingBalance -travel_filt$endingBalance
saveRDS(travel_filt, 'data/travel_filt.rds')
# A tibble: 6 × 11
participantId travelStartTime travelStartLocationId
<dbl> <dttm> <dbl>
1 23 2022-03-01 05:20:00 532
2 876 2022-03-01 05:50:00 NA
3 902 2022-03-01 06:05:00 NA
4 919 2022-03-01 06:00:00 NA
5 154 2022-03-01 05:55:00 NA
6 509 2022-03-01 06:00:00 NA
# … with 8 more variables: travelEndTime <dttm>,
# travelEndLocationId <dbl>, purpose <chr>, checkInTime <dttm>,
# checkOutTime <dttm>, startingBalance <dbl>, endingBalance <dbl>,
# amountSpent <dbl>
Grouping the data by the travelEndLocationId which is equal to the restaurant ID or the pub ID.
travel_group = travel_filt %>%group_by(travelEndLocationId) %>%
summarise(amountSpent = sum(amountSpent),
.groups = 'drop')%>%
arrange(desc(amountSpent))
library(dplyr)
data_merge <-merge(x=travel_group, y=restaurants, by.x = 'travelEndLocationId', by.y = 'restaurantId')
data_merge$travelEndLocationId <- as.character(data_merge$travelEndLocationId)
data_merge$amountSpent <- data_merge$amountSpent/1000
above plot is modified considering the principles of data visualization. The above plot has simple X and Y axis, and is not so visually appealing. A line has been plotted along with the bar chart to make the differences in count easily visible to the reader.
ggplot(data = data_merge,aes(x = travelEndLocationId, y=amountSpent)) +
geom_col()+
xlab("Business ID") +
ylab("Revenue (in thousands $)")
library(plotly)
color = c('rgba(222,45,38,0.8)','rgba(204,204,204,1)','rgba(204,204,204,1)', 'rgba(204,204,204,1)','rgba(204,204,204,1)','rgba(204,204,204,1)','rgba(204,204,204,1)','rgba(204,204,204,1)','rgba(204,204,204,1)', 'rgba(204,204,204,1)','rgba(204,204,204,1)','rgba(222,45,38,0.8)','rgba(222,45,38,0.8)','rgba(222,45,38,0.8)',
'rgba(222,45,38,0.8)','rgba(204,204,204,1)','rgba(204,204,204,1)','rgba(204,204,204,1)','rgba(204,204,204,1)','rgba(204,204,204,1)')
fig<- plot_ly(data_merge, x = ~reorder(`travelEndLocationId`, -amountSpent), y = ~amountSpent, type = 'bar', marker = list(color = color))
fig <- fig %>% layout(title = "Revenue for Restraunts",
xaxis = list(title = "Retaurant ID"),
yaxis = list(title = "Revenue (in thousands $)"))
fig
data_pub <-merge(x=travel_group, y=pubs, by.x = "travelEndLocationId", by.y = "pubId")
data_pub$travelEndLocationId <- as.character(data_pub$travelEndLocationId)
data_pub$amountSpent <- data_pub$amountSpent/1000
library(plotly)
color = c('rgba(222,45,38,0.8)','rgba(222,45,38,0.8)','rgba(222,45,38,0.8)', 'rgba(204,204,204,1)','rgba(204,204,204,1)','rgba(204,204,204,1)','rgba(204,204,204,1)','rgba(204,204,204,1)','rgba(204,204,204,1)', 'rgba(204,204,204,1)', 'rgba(204,204,204,1)', 'rgba(204,204,204,1)')
fig<- plot_ly(data_pub, x = ~reorder(`travelEndLocationId`, -amountSpent), y = ~amountSpent, type = 'bar', marker = list(color = color))
fig <- fig %>% layout(title = "Revenue for Pubs",
xaxis = list(title = "Pub ID"),
yaxis = list(title = "Revenue (in thousands $)"))
fig
treemap(data_merge,
index=c("travelEndLocationId"),
vSize="amountSpent",
vColor="amountSpent",
title="Amount Spent in thousands of Dollars - Restaurants",
title.legend = "Amount Spent in thousands of Dollars - Restaurants"
)
library(ggdist)
travel_filt %>%
ggplot(aes(x = purpose,
y = amountSpent)) +
stat_gradientinterval(
fill = "skyblue",
show.legend = TRUE
) +
theme_bw()+
scale_y_continuous(breaks = seq(0, 35, by = 5),
limits = c(0, 35))+
labs(
title = "Visualising confidence intervals of mean math score",
subtitle = "Gradient + interval plot")+
labs(
y= 'Revenue',
x= 'Purpose',
title = "Revenue generated by Purpose",
caption = "Economic information, Ohio USA"
)+
theme(
axis.title.y= element_text(angle=0), axis.ticks.x= element_blank(),
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 1),
)